Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework

نویسنده

Fabian Fier

چکیده

Motivated by increasing dataset sizes, various MapReducebased similarity join algorithms have emerged. In our past work (to appear), we compared nine of the most prominent algorithms experimentally. Surprisingly, we found that their runtimes become inhibitively long for only moderately large datasets. There are two main reasons. First, data grouping and replication between Map and Reduce relies on input data characteristics such as word distribution. A skewed distribution as it is common for textual data leads to data groups which reveal very unequal computation costs, leading to Straggling Reducer issues. Second, each Reduce instance only has limited main memory. Data spilling also leads to Straggling Reducers. In order to leverage parallelization, all approaches we investigated rely on high replication and hit this memory limit even with relatively small input data. In this work, we propose an initial approach toward a join framework to overcome both of these issues. It includes a cost-based grouping and replication strategy which is robust against large data sizes and various data characteristics such as skew. Furthermore, we propose an addition to the MapReduce programming paradigm. It unblocks the Reduce execution by running Reducers on partial intermediate datasets, allowing for arbitrarily large data sets between Map and Reduce.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Similarity analysis with advanced relationships on big data

Similarity analytic techniques such as distance based joins and regularized learningmodels are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover’s Distance (EMD) are usually employed to capture the similarity between data dim...

متن کامل

Efficient Large Outer Joins over MapReduce

Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...

متن کامل

H2RDF+: High-performance distributed joins over large-scale RDF graphs

The proliferation of data in RDF format calls for efficient and scalable solutions for their management. While scalability in the era of big data is a hard requirement, modern systems fail to adapt based on the complexity of the query. Current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this work...

متن کامل

Cost Based Multi-Way Equi-Join Optimization in MapReduce

MapReduce is a prominent programming model above shared nothing architecture for processing big data with a parallel, distributed algorithm on a cluster. Join is an important operation is very inefficient in MapReduce. In this work, a time cost based evolution model is proposed for multi-way join by considering the time cost calculation. A multi-way join consists of start pattern joins and chai...

متن کامل

Error-Tolerant Big Data Processing

Real-world data contains various kinds of errors. Before analyzing data, one usually needs to process the raw data. However, traditional data processing based on exactly match often misses lots of valid information. To get high-quality analysis results and fit in the big data era, this thesis studies the error-tolerant big data processing. As most of the data in real world can be represented as...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework

نویسنده

چکیده

منابع مشابه

Similarity analysis with advanced relationships on big data

Efficient Large Outer Joins over MapReduce

H2RDF+: High-performance distributed joins over large-scale RDF graphs

Cost Based Multi-Way Equi-Join Optimization in MapReduce

Error-Tolerant Big Data Processing

عنوان ژورنال:

اشتراک گذاری